再帰からアテンションへ：順序モデルの限界を克服する

従来の順序モデリングは、再帰型ニューラルネットワーク（RNN）そのゲート付き変種（LSTM、GRU）に大きく依存していました。初期のシーケンス対シーケンスタスクにおいて画期的でしたが、長距離の依存関係を処理する際には根本的なスケーラビリティの問題に直面します。アテンション機構の導入により、これらの制約を乗り越え、現代の高効率なNLPシステムを可能にする重要な概念的飛躍が実現しました。

1. 長距離依存問題

RNNでは、トークン $t_i$ とトークン $t_j$ の間の依存パスは、すべての中間ステップを逐次的に通る必要があります。これにより、逆伝播中の勾配信号が重み行列を何度も乗算することになり、勾配消失信号の急速な減衰を引き起こします。これにより、長距離にわたり有用な情報や誤差信号を伝えることがほとんど不可能になります。このパスの複雑さは $O(N)$ です。

2. 固定サイズのコンテキストボトルネック

従来のエンコーダデコーダアーキテクチャでは、長さに関係なく、ソースシーケンス全体の意味を、単一の固定次元ベクトル（コンテキストベクトル、$C$）に圧縮する必要がありました。このボトルネックは、特に長いまたは複雑な入力に対して、モデルが必要な情報をすべて保持できる能力を大幅に制限し、デコード段階で重要な情報損失が発生します。

概念的表現

RNN Context Bottleneck

A visualization illustrating the traditional RNN Encoder-Decoder structure where the sequence is compressed into a single, fixed-size vector before being passed to the decoder. This point of compression often results in the loss of fine-grained information required for accurate long-sequence translation.

Diagram of an RNN Encoder-Decoder showing the context vector bottleneck

Question 1

Why is the dependency path length in a standard RNN considered a major limitation for long sequences?

Path complexity is $O(1)$.

Path complexity is $O(N^2)$.

Path complexity is $O(N)$, causing vanishing gradients.

It prevents the use of LSTMs.

Question 2

In pre-Attention Seq2Seq models, what component represents the 'information bottleneck'?

The softmax layer.

The recurrent cell (e.g., GRU).

The fixed-size context vector derived from the encoder's final hidden state.

The input embedding layer.

Challenge: Conceptualizing Attention's Advantage

Comparing Structural Complexity

Consider a sequence of length $N$. We want to establish a dependency between token $X_i$ and token $Y_j$.

Contrast the dependency path length required by:

Traditional Recurrence (e.g., LSTM)
Attention Mechanism (Query-Key comparison)

Step 1

How does Attention fundamentally reduce the structural complexity of establishing distant dependencies?

Solution:
Attention creates a direct, non-sequential connection between any output token $Y_j$ and any input token $X_i$ by calculating a weight based on their vector similarity ($Q_j K_i^T$). The dependency path length is effectively $O(1)$ (a direct look-up), removing the constraint of linear path traversal imposed by recurrence ($O(N)$).